CONTENTS

Chapter 19. XML and Cocoon

So far we have talked about different ways of writing scripts, worrying more about the logic they contain than their content. Working with XML and Cocoon takes a rather different tack, defining transformation pathways from a generic XML format to destination formats, typically HTML but possibly in other formats. Using this approach, a single set of documents can be used to generate a variety of different representations appropriate to different devices or situations.

19.1 XML

Like HTML, Extensible Markup Language (XML) uses markup (elements, attributes, comments, etc.) to identify content within a document. Unlike HTML, XML lets developers create their own vocabularies to describe that content, encouraging a much greater separation of content from presentation. When we wrote this page, we put the chapter title at the top right hand corner of a blank page: "XML and Cocoon." Then we started on the text:

So far we have talked about different ways of writing scripts, worrying more about the logic they contain than their content...

If you put this book down open and come back to it tomorrow, a glance at the top of the page reminds you of the subject of this chapter, and a glance at the top of the paragraph reminds you where we have got to in that chapter.

It is not necessary to explain what these typographic page elements are telling you because we have all been reading books for years in a civilization that has had cheap printing and widespread literacy for half a millennium, so we don't even think about the conventions that have developed.

Putting the right message in the right sort of type in the right place on the page in order to convey the right meaning to the reader was originally a specialized technical job done by the book editor and the printer.

Now, computing is changing all that. We typeset our own manuscripts with the help of publishing packages. We publish our own books without the help of trained editors. We don't have to bother with the book format: we publish our own web pages by the billion, often without recourse to any standards of layout, intelligibility, or even sanity. Since computer data has no inherent format to tell us what it means, there is — and has been for a long time — an urgent need for some sort of markup language to tell us at what we are looking.

A start was made on solving the problem many decades ago with the Standard Generalized Markup Language (SGML). This evolved informally for a long time and then was accepted by the International Organization for Standardization (ISO) in 1986. SGML has been taken up in a number of industries and used to define more specfic tag languages: ATA-2100 for aircraft maintenance manuals, PCIS in the semiconductor industry, DocBook for software documentation in the computer industry.

HTML is an application of SGML. It uses a very small subset of SGML's functionality with a single vocabulary. Its limitations are growing clearer, even though millions of lines of it are in use every second of the day around the world. The trouble is that HTML simply says how text should appear on the client's computer screen. You might be a nurse looking at a web page containing a patient's medical record. The patient is lying unconscious on a stretcher and desperately needs penicillin. Is she allergic to the drug? The word "penicillin" might appear 20 times in his record — she was given it on various dates scattered here and there. Did one of these turn out badly? Is there a note somewhere about allergies? You might have to read a hundred pages, and you haven't the time. What you need is a standard medical markup:

<allergies><drug-reactions>....</drug-reactions></allergies>

and a quick way of finding it, probably through an applet.

In principle, SGML could do what is wanted on the Web. Unfortunately, it is very complicated; it was first specified in the days when every byte mattered, so it is full of cunning shortcuts, it is too big for developers to learn, and it's too big for browsers to implement. So XML is a cut-down version that does what is needed and not too much more. XML requires much stricter attention to document structure but offers a much wider choice of vocabularies in return.

On the other hand, XML differs from HTML in that it is a completely generalized markup language. HTML has a small list of prespecified tags: <HEAD>, <H2>, <HREF...>, etc. XML has no prespecified tags at all. Its tags are invented by its users as necessary to define the information that a page will carry — as, for instance <allergies><drug-reactions> earlier. The tags to be used are stored in a Document Type Definition (DTD) (soon to be replaced by XML Schemas). The DTD also defines the structure of the document as a tree: <book>s contain <chapter>s and <chapter>s contain <paragraph>s. A <paragraph> never contains a <book>. A <drug-reaction> comes inside the more general <allergies>, and so on. It is technically quite simple to write a DTD, but in most applications much more work goes into getting the agreement of other people about the structure of the document and the types of information that need to be in it. (For more information on writing DTDs, see Erik Ray's Learning XML (O'Reilly, 2000.)

The idea of XML goes way beyond formatting and displaying information, though that is a very useful consequence. It is a way of handling information to produce other information. The usefulness of this approach is well explained by Brett McLaughlin in his Java and XML.[1] He uses as an illustration the process of selling a network line to a customer.

...When a network line, such as a DSL or T1, is sold to a customer, a variety of things must happen. The provider of the line, such as UUNet, must be informed of the request for a new line. A router must be configured by the CLEC and the setup of the router must be coordinated with the Internet service provider. Then an installation must occur, which may involve another company if this process is outsourced. This relatively common and simple sale of a network line already involves three companies. Add to this the technical service group for the manufacturer of the router, the phone company for the customer's other communication services, and the InterNIC to register a domain, and the process becomes significant.

This rather intimidating process can be made extremely simple with the use of XML. Imagine that the original request for a line is put into a system that converts the request into an XML document. The document is then transferred via XSL, into a format that can be sent to the line provider, UUNet in our example. UUNet then adds line-specific information, transforming the request into yet another XML document, which is returned to the CLEC. This new document is passed on to the installation company with additional information about where the client is located. Upon installation, notes about whether or not the installation was successful are added to the document, which is transformed again via XSL and passed back to the original CLEC application. The beauty of this solution is that instead of multiple systems, each using vendor-specific formatting, the same set of XML APIs can be used at every step, allowing a standard interface for the XML data across the applications, systems, and even businesses.

One might add that if all the participants in the process subscribe to an industry-standard DTD, it would not even be necessary to transform the documents using XSL.

As this process proceeds, hard copies of documents will need to be printed out and signed to show that legally important stages in the transaction have been reached. This can be done by stylesheets written in XSL — Extensible Stylesheet Language. The stylesheet specifies the font type-size and position of all the elements of the document. It can control a certain amount of reformatting: a long document might start with a list of contents generated by collecting the section headers and their page numbers. Different but similar stylesheets could produce the same document in a variety of different formats: HTML, PDF, WML (for WAP devices), even voice for the blind, or Braille.

Clearly the Web has to have something like XML, and sooner or later we will all be using it if we want to publish serious amounts of information. No one suggests that HTML will vanish overnight because it is very suitable for small jobs — just as you wouldn't use a full blown book-production software package to write a letter. The W3C is rebuilding HTML on an XML foundation, called XHTML, to facilitate that transition. For the moment, XML's use on the Web is more impending rather than actual, but it is growing rapidly. A few of the many vocabularies include the following:

For a huge list of vocabularies and supporting technologies, see the XML Cover Pages at http://xml.coverpages.com.

People supplying and exchanging information use XML as a medium that allows them to specify the meaning and the value of bits of information. Often several XML documents are merged to create a new output. In theory you can send the resulting XML and a CSS or XSLT stylesheet to a browser, and something will appear that can be read on a screen. However, in practice, few browsers will properly interpret XML. Microsoft Internet Explorer v5 and later offer some capability, while Opera Version 4 or later, Netscape 6 or later, and all of the Mozilla builds offer more control over the presentation of XML documents. Older browsers that appeared before XML's 1998 release have little idea what to do with the unfamiliar markup.

It would be nice if browsers did the conversion because it shifts the processing burden from the server to the client (and since we are buyers of server hardware, this is better). For the moment and possibly for a long time in the future, people who want to display XML data on the Web have to convert their pages to HTML (or perhaps PDF or some other format) by putting it through some more or less clever program. Although it is possible in principle to transform XML into, say, HTML by applying a stylesheet, the "applying" bit may not be so easy. You might have to write (but see later) a script in Perl to make the transformation. Clearly, this isn't something that every webmaster wants to do, and software to do the job properly is available as a "publishing framework." There are a number of contenders, but a package well suited to Apache users is Cocoon, which is produced under the auspices of the Apache XML project.

19.2 XML and Perl

Before you embark seriously on Cocoon, you might like to look at the FAQs (http://xml.apache.org/cocoon/faqs.html#faq-noant). This will give you some notion of the substantial size, complexity, and tentative condition of the intellectual arena in which you will operate.

If you don't feel quite up to embarking on the Java adventure (which seems to one of us (PL) comparable with trying to walk a straight line from New York to the South Pole), but you still need to get to grips with XML, there are a large number of Perl packages on CPAN (http://search.cpan.org/search?mode=module&query=xml), which might produce useful results much faster. The interface between Perl and Apache is covered in Chapter 16 and Chapter 17. Another option, also hosted by the XML Apache Project, is AxKit (http://axkit.org), a Perl package for transforming and presenting information stored in XML.

19.3 Cocoon

Go to http://xml.apache.org/cocoon/index.html for an introduction to Cocoon and a link to the download page. You will see that a number of mysterious entities are mentioned: Xerces, Xalan, FOP, Xang, SOAP. These are all subsidiary packages that are used to make up Cocoon. What you need of them is included with the Cocoon download and is guaranteed to work, even though they may not be the latest releases. This makes the file rather large, but saves problems with inconsistent versions.

If you are running Apache on a platform where support for JDK 1.2 is either missing or difficult, you may still find it useful to run an older version of Cocoon. The following section documents Cocoon 1.8 installation with JServ, as well as the more recent Cocoon 2.0.3, which uses Tomcat. Both sources and binary versions are available for both multiple platforms.

19.4 Cocoon 1.8 and JServ

Go to http://xml.apache.org/cocoon/index.html for an introduction to Cocoon and a link to the download page. You will see that a number of mysterious entities are mentioned: Xerces, Xalan, FOP, Xang, SOAP. These are all subsidiary packages that are used to make up Cocoon. What you need of them is included with the Cocoon download and is guaranteed to work, even though they may not be the latest releases. This makes the file rather large, but saves problems with inconsistent versions.

If you are running Win32, download the zipped executable; if Unix, then download the sources. We got Cocoon-1.8.tar.gz, which was flagged as the latest distribution.

As usual read the README file. It tells you that the documentation is in the .../docs subdirectory as .html files — what it might mention, but did not, is that these files are formatted using fixed-width tables for a wide screen and, if you want hardcopy, don't print out well. They are not easy to read either, so more flexible versions, suitable for reading and printing, are in the .../docs.printer subdirectory. There is a snag, which appeared later: the printable files are completely different from the screen files and omit a crucial piece of information. Still, as the reader will have gathered, this is normal stuff in the world of Java.

What follows is a minimum version of the installation process.

It seemed sensible to read install.html. Since Cocoon is a Java servlet, albeit rather a large one, you need a Java virtual machine, v1.1 or better. We had v1.1.8. If you have v1.2 or better, you need to treat the file <jdk_home>/lib/tools.jar, which contains the Java compiler, as a Cocoon component and include it in your classpath. This meant editing .login again (see Chapter 18) to include:

setenv CLASSPATH "/usr/src/java/jdk1.1.8/lib/tools.jar:."

We have to make Cocoon and all its bits visible to JServ by editing the file: usr/local/bin/etc/jserv.properties. The Cocoon documentaion suggests that you add the lines:

wrapper.classpath=/usr/local/java/jdk1.1.8/lib/classes.zip
wrapper.classpath=/usr/src/cocoon/bin/cocoon.jar
wrapper.classpath=/usr/src/cocoon/lib/xerces_1_2.jar
wrapper.classpath=/usr/src/cocoon/lib/xalan_1_2_D02.jar
wrapper.classpath=/usr/src/cocoon/lib/fop_0_13_0.jar

Of course these paths were not correct for our machine. In JDK 1.1.8 there is no tools.jar, so we used classes.zip. Do not add servlet_2_2.jar, or Cocoon will not work. You should find a location in the jserv.properties file that already deals with "wrappers," so that would be a good place for it.

Next, we are told:

At this point, you must set the Cocoon configuration. To do this, you must choose the servlet zone(s) where you want Cocoon to reside. If you don't know what a servlet zone is, open the zone.properties file.

We opened usr/local/bin/etc/zone.properties. The file has a lot of technical comments in it, which would make sense if you knew all about the subject. It would be overstating things to say that we instantly learned what a "servlet zone" is. The instructions go on to say that we should add the line:

servlet.org.apache.cocoon.Cocoon.initArgs=properties=[path to cocoon]/
bin/cocoon.properties

As is normal with anything to do with Java, the advice is not quite accurate. There was no .../bin/cocoon.properties in the download. The file appeared (identically, as tested by the Unix utility diff) in two other locations, so we copied one of them to /usr/local/bin/etc (where all the other configuration files are) and added the line:

servlet.org.apache.cocoon.Cocoon.initArgs=properties=/usr/local/
bin/etc/cocoon.properties

at the bottom of the zone.properties file.

Finally, we had to attack the jserv.conf file. We set ApJServLogFile to DISABLED, which sends JServ errors to the Apache error_log file. We were also told to add the lines:

AddHandler cocoon xml
Action cocoon /servlet/org.apache.cocoon.Cocoon

where "/servlet/ is the mount point of your servlet zone (and the above is the standard name for servlet mapping for Apache JServ)."

These are, of course, Apache directives, operative because the file jserv.conf is included in the site's Config file. It was not very clear what was this was trying to say, but we copied these two lines literally into jserv.conf — within the <IfModule mod_jserv.c> block.

Apache started cleanly (check the error log), but an attempt to access http://www.butterthlies.com/index.xml produced the browser message:

Publishing Engine could not be initialized.
java.lang.RuntimeException: Can't create store repository: ./repository. Make sure 
it's there or you have writing permissions.
In case this path is relative we highly suggest you to change this to an absolute path 
so you can control its location directly and provide valid access rights.
      at org.apache.cocoon.processor.xsp.XSPProcessor.init(XSPProcessor.java:194)
....

Since the "repository" is defined in zone.properties as:

repositories=/usr/local/bin/servlets

the problem didn't seem to be a relative path, so it was presumably the write permission. We changed this by going up a directory and executing:

chmod a+w servlets

After a restart of Apache, this produced the same browser error. After further research, it appeared that, in true Java fashion, there were at least two completely different things called the "repository." The one that seemed to be giving trouble was specified in cocoon.properties by the line:

processor.xsp.repository=./repository

We changed it to:

processor.xsp.repository=/usr/local/bin/etc/repository

and applied:

chmod a+w repository

This solved the Engine initialization problem, but only to reveal a new one:

java.lang.RuntimeException: Error creating org.apache.cocoon.processor.xsp.
XSPProcessor: make sure the needed classes can be found in the classpath (org/apache/
turbine/services/resources/TurbineResourceService)
...

This stopped us for a while. We looked in the configuration files for some command involving a "turbine" in the hope of commenting it out and failed to find any. Then we noticed that in cocoon.properties the word "turbine" appeared in comments near a block of commands clearly involving database stuff. Perhaps, we thought, the problem was not that "turbine" should be deleted, but that something else in Cocoon wanted a "turbine," even though there was no database to interface to, and couldn't get it. We found a file /usr/src/cocoon/lib/turbine-pool.jar and added the line:

wrapper.classpath=/usr/src/cocoon/lib/turbine-pool.jar

to usr/local/bin/etc/jserv.properties.

To our surprise Cocoon then started working. To be fair, the unprintable original installation instructions did mention turbine-pool.jar and said it was essential. However, the printable version, which we used, did not.

When you wrestle with this stuff, you will probably find that you have to restart Apache several times to activate changes in the Cocoon steup files. You may find that you get entries in the error_log:

... Address already in use: make_sock: could not bind to port 80

This is caused by restarting Apache while the old version is still running. Even though the JServ component may have failed, Apache itself probably has not and won't run twice binding to the same port. You need to kill and restart it each time you change anything in Cocoon.

19.5 Cocoon 2.0.3 and Tomcat

Cocoon 2.0.3 is pretty completely self-contained. The collection of classes in Cocoon and Tomcat has been tuned to avoid any conflicts, and installing Cocoon on an existing Tomcat installation involves adding one file to Tomcat and adding some directives to httpd.conf. As Java installations go, this one is quite friendly.

Unless you have a strong need to customize Cocoon directly, by far the easiest way to install Cocoon is to download the binary distribution, in this case from http://xml.apache.org/dist/cocoon/. Installing Cocoon on Tomcat 3.3 or 4.0 (with the exception of 4.03, for which you should read the docs about some CLASSPATH issues) requires unzipping the distribution file and copying the cocoon.war file into the /webapps directory of the Tomcat installation and restarting Tomcat. When Tomcat restarts, it will find the new file, expand it into a cocoon directory, and configure itself to support Cocoon. (Once this is done, you can delete the cocoon.war file.)

If you've left Tomcat running its independent server, you can test whether Cocoon is running by firing up a browser and visiting http://localhost:8080/cocoon on your server. You should see the welcome screen for Cocoon. To move beyond using Tomcat by itself (which is fairly slow, though useful for testing), you have two options, depending on which Apache module you use to connect the Apache server to Tomcat.

The older (but in some ways more capable) option is to use mod_jk, as described in Chapter 18. If you are using mod_jk, you can connect the Cocoon examples to Apache quite simply using by adding the directive:

JkMount /cocoon/* ajp12

to your httpd.conf file and restarting Apache. mod_jk is designed to support general integration of Java Servlets and Java Server Pages with Apache and provides finer-grained control over how Apache calls on these facilities. mod_jk also provides support for Apache's load-balancing facilities.

The newer approach uses mod_webapp, a module that seems more focused on simple connections between the Apache server and particular applications. mod_webapp comes with Tomcat 4.0 and higher, and you can find binary and RPM releases as well as source at http://jakarta.apache.org/builds/jakarta-tomcat-connectors/webapp/release/v1.2.0/. mod_webapp provides far fewer options, but it can connect Cocoon to Apache quickly and cleanly. You can either download a binary distribution or download a source distribution and compile it, and then copy the mod_webapp.so file to your Apache module folder. Once you've done that, you'll need to tell Apache to use mod_webapp for requests to /cocoon. Adding the following lines to your httpd.conf file should do the trick:

# Load the mod_webapp module
LoadModule webapp_module libexec/mod_webapp.so

AddModule mod_webapp.c

# Creates a connection named "warpConn" between the web server and the servlet 
# container located on the "127.0.0.1" IP address and port "8008" using 
# the "warp" protocol
<IfModule mod_webapp.c>
WebAppConnection warpConn warp 127.0.0.1:8008

# Mount the "cocoon" web application found thru the "warpConn" connection 
# on the "/cocoon" URI 
WebAppDeploy  cocoon  warpConn  /cocoon
</IfModule>

Once you've restarted Apache, you'll be able to access Cocoon through Apache. (For more information on differences between mod_webapp and mod_jk and why you might want to choose one over the other, see http://www.mail-archive.com/[email protected]/msg26335.html.)

19.6 Testing Cocoon

While the Cocoon examples are a welcome way to see that the installation process has gone smoothly, you'll most likely want to get your own documents into the system. Unlike the other application-building tools covered in the last few chapters, most uses of Cocoon start with publishing information rather than interacting with users. The following demonstration provides a first step toward publishing your own information, though you'll need a book on XSLT to learn how to make the most of this.

We'll start with a simple XML document containing a test phrase:

<?xml version="1.0"?>
<phrase>
 testing, testing, 1... 2... 3...
</phrase>

Save this as test.xml in the main Cocoon directory. Next, we'll need an XSLT stylesheet, stored as test2html.xsl in the main Cocoon directory, to transform that "phrase" document into an HTML document:

<?xml version="1.0"?>
<xsl:stylesheet version="1.0"
     xmlns:xsl="http://www.w3.org/1999/XSL/Transform">

<xsl:template match="phrase">
 <html>
  <head><title><xsl:value-of select="." /></title></head>
  <body><h1><xsl:value-of select="." /></h1></body>
 </html>
</xsl:template>

</xsl:stylesheet>

This stylesheet creates an HTML document when it encounters the phrase element and uses the contents of the phrase element (referenced by <xsl:value-of select="." />, which returns the contents of the current context) to fill in the title of the HTML document, as well as a header in body content. What appeared once in the XML document will appear twice in the HTML result.

We now have the pieces that Cocoon can use to generate HTML, but we still need to tell Cocoon that these parts have a purpose. Cocoon uses a site map, stored in the XML file sitemap.xmap, to manage all of its processing. Processing is defined using pipelines, which can be sophisticated combinations of stylesheets and code, but which in our case need to provide a home for an XML document and its XSLT transformation. By adding one map:pipeline element to the end of the map:pipelines element, we can add our test to the list of pipelines Cocoon will run.

  <map:pipeline>
    <map:match pattern="test" />
    <map:generate src="test.xml" />
    <map:transform src="test2html.xsl" />
    <map:serialize />
  </map:pipeline>

This pipeline will match any requests to "test" that Cocoon receives, which means that we'll see the results at http://localhost/cocoon/test. It will take the test.xml document, transform it using the test2html.xsl document, and then serialize the document for delivery using its standard HTML serializer. Once you save this file, Cocoon will be ready to display our test — there's no need to restart Cocoon, Tomcat, or Apache.

Visiting http://localhost/cocoon/test with a browser shows off the result of the transformation. A close look at the source code reveals that Cocoon has been at work, and its HTML serializer even added some metacontent:

<html><head>
<meta http-equiv="Content-Type" content="text/html; charset=UTF-8"><title>testing, 
testing, 1... 2... 3...</title></head>
<body>
<h1>
 testing, testing, 1... 2... 3...
</h1>
</body></html>

This is a very small taste of Cocoon's capabilities, but this foundation demonstrates that you can use Cocoon in conjunction with Tomcat Apache without having to make many changes to your Apache installation.

[1]  Brett McLaughlin, Java and XML (O'Reilly & Associates, Inc., 2001).

CONTENTS